Scholarly text is often laden with jargon, or specialized language that divides disciplines. We extend past work that characterizes science at the level of word types, by using BERT-based word sense induction to find additional words that are widespread but overloaded with different uses across fields. We define scholarly jargon as discipline-specific word types and senses, and estimate its prevalence across hundreds of fields using interpretable, information-theoretic metrics. We demonstrate the utility of our approach for science of science and computational sociolinguistics by highlighting two key social implications. First, we measure audience design, and find that most fields reduce jargon when publishing in general-purpose journals, but some do so more than others. Second, though jargon has varying correlation with articles' citation rates within fields, it nearly always impedes interdisciplinary impact. Broadly, our measurements can inform ways in which language could be revised to serve as a bridge rather than a barrier in science.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
在对机器学习研究的可靠性和可信度的越来越关注的越来越关注的情况下,我们提出了一个有原则的框架,用于提出可靠和可推广的主张:多元宇宙分析。我们的框架建立在多元宇宙分析(Steegen等,2016)的基础上,该框架是为了应对心理学自身的可重复性危机而引入的。为了有效地探索高维且经常连续的ML搜索空间,我们用高斯工艺替代品对多元宇宙进行建模,并应用贝叶斯实验设计。我们的框架旨在促进有关模型性能的强大科学结论,因此我们的方法着重于探索而不是常规优化。在两个案例研究中的第一个中,我们研究了关于自适应优化者相对优点的有争议的主张。其次,我们综合了关于学习率对大批次培训概括差距的影响的矛盾研究。对于机器学习社区而言,多元宇宙分析是一种简单有效的技术,用于识别稳定的主张,提高透明度以及迈向改善可重复性的一步。
translated by 谷歌翻译
通过提供前所未有的计算资源访问,云计算能够在机器学习等技术中快速增长,其计算需求产生了高能源成本和相应的碳足迹。结果,最近的奖学金呼吁更好地估计AI的温室气体影响:当今的数据科学家无法轻松或可靠地访问该信息的测量,从而排除了可行策略的发展。向用户提供有关软件碳强度的信息的云提供商是一种基本的垫脚石,以最大程度地减少排放。在本文中,我们提供了一个测量软件碳强度的框架,并建议通过使用每个能量单元使用基于位置和特定时间的边际排放数据来测量运行碳排放。我们为一组自然语言处理和计算机视觉的现代模型提供了操作软件强度的测量,以及各种模型尺寸,包括预处理61亿个参数语言模型。然后,我们评估了一套用于减少Microsoft Azure Cloud Compute平台排放的方法套件:使用不同地理区域中的云实例,在一天中的不同时间使用云实例,并在边际碳强度高于某个阈值时动态暂停云实例。我们证实了先前的结果,即数据中心的地理区域在给定云实例的碳强度中起着重要作用,并发现选择合适的区域可能具有最大的运营排放减少影响。我们还表明,一天中的时间对操作软件碳强度有显着影响。最后,我们最终提出了有关机器学习从业人员如何使用软件碳强度信息来减少环境影响的建议。
translated by 谷歌翻译
生成语言模型均受到各种普通域的语料库的培训。但是,这将其适用于较窄的域,并且前后工作表明,持续的域培训可以提供进一步的收益。在本文中,我们使用计算有效的适配器方法介绍一种向许多不同域扩展到许多不同域的方法。我们的方法基于观察,即文本域部分重叠,并且我们将域表示为分层树结构,其中树中的每个节点与一组适配器权重相关联。当与冻擦级语言模型结合时,该方法使相关域之间的参数共享,同时避免不相关之间的负干扰。它是D域的o(log(d))的有效和计算成本缩放。使用GPT-2的实验结果和C4中的100个最具代表网站的大部分显示在域中的域内改进。我们还提供了一个用于保持域的推理时间算法,并显示通过树的多个路径的平均实现泛化的进一步增益,同时仅增加边际成本推断。
translated by 谷歌翻译
随着人工智能系统变得越来越强大和普遍,人们对机器的道德或缺乏道德的关注变得越来越关注。然而,向机器讲授道德是一项艰巨的任务,因为道德仍然是人类中最激烈的争论问题之一,更不用说AI了。但是,部署到数百万用户的现有AI系统已经在做出充满道德影响的决策,这构成了一个看似不可能的挑战:教学机器的道德意义,而人类继续努力努力。为了探索这一挑战,我们介绍了Delphi,这是一个基于深层神经网络的实验框架,直接训练了描述性道德判断,例如,“帮助朋友”通常是不错的,而“帮助朋友传播假新闻”不是。经验结果提供了对机器伦理的承诺和局限性的新见解。面对新的道德情况,德尔菲(Delphi)表现出强大的概括能力,而现成的神经网络模型表现出明显差的判断,包括不公正的偏见,证实了对明确教学机器的道德意义的必要性。然而,德尔菲并不完美,表现出对普遍性偏见和不一致的敏感性。尽管如此,我们还是展示了不完美的Delphi的积极用例,包括在其他不完美的AI系统中将其用作组件模型。重要的是,我们根据著名的道德理论来解释Delphi的运营化,这使我们提出了重要的未来研究问题。
translated by 谷歌翻译
最近在NLP中的工作已经记录了输入功能和输出标签之间的DataSet工件,偏置和虚假相关性。但是,如何判断哪些功能具有“虚假”而不是合法相关性通常留下未指定。在这项工作中,我们认为,对于复杂的语言理解任务,所有简单的特征相关性都是虚假的,我们将这一概念正式化为一类我们称之为能力问题的问题。例如,自己的“惊人”一词不应提供关于情绪标签的信息,无论出现的背景,哪些内容都可以包括否定,隐喻,讽刺等。我们理论上分析创建能力问题数据的难度当考虑人类偏见时,显示现实数据集将越来越偏离能力问题,因为数据集大小增加。此分析为我们提供了一个简单的数据集工件统计测试,我们用于显示比在事先工作中描述的更细微的偏见,包括展示模型与这些不太极端的偏差影响不恰当地影响。我们对此问题的理论处理也允许我们分析所提出的解决方案,例如将本地编辑为数据集实例制作,并为未来的数据收集和模型设计努力提供目标能力问题的建议。
translated by 谷歌翻译
Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.
translated by 谷歌翻译
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
translated by 谷歌翻译
Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.
translated by 谷歌翻译